Policy Improvement for POMDPs Using Normalized Importance Sampling

نویسنده

Christian R. Shelton

چکیده

We present a new method for estimating the expected return of a POMDP from experi ence. The estimator does not assume any knowledge of the POMDP, can estimate the returns for finite state controllers, allows ex perience to be gathered from arbitrary se quences of policies, and estimates the return for any new policy. We motivate the estima tor from function-approximation and impor tance sampling points-of-view and derive its bias and variance. Although the estimator is biased, it has low variance and the bias is of ten irrelevant when the estimator is used for pair-wise comparisons. We conclude by ex tending the estimator to policies with mem ory and compare its performance in a greedy search algorithm to the REINFORCE algo rithm showing an order of magnitude reduc tion in the number of trials required.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Importance Sampling Estimates for Policies with Memory

Importance sampling has recently become a popular method for computing off-policy Monte Carlo estimates of returns. It has been known that importance sampling ratios can be computed for POMDPs when the sampled and target policies are both reactive (memoryless). We extend that result to show how they can also be efficiently computed for policies with memory state (finite state controllers) witho...

متن کامل

Monte Carlo Sampling Methods for Approximating Interactive POMDPs

Partially observable Markov decision processes (POMDPs) provide a principled framework for sequential planning in uncertain single agent settings. An extension of POMDPs to multiagent settings, called interactive POMDPs (I-POMDPs), replaces POMDP belief spaces with interactive hierarchical belief systems which represent an agent’s belief about the physical world, about beliefs of other agents, ...

متن کامل

The Cross-Entropy Method for Policy Search in Decentralized POMDPs

Decentralized POMDPs (Dec-POMDPs) are becoming increasingly popular as models for multiagent planning under uncertainty, but solving a Dec-POMDP exactly is known to be an intractable combinatorial optimization problem. In this paper we apply the Cross-Entropy (CE) method, a recently introduced method for combinatorial optimization, to Dec-POMDPs, resulting in a randomized (sampling-based) algor...

متن کامل

An Empirical Analysis of Off-policy Learning in Discrete MDPs

Abstract Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-hor...

متن کامل

A comparative study of counterfactual estimators

We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dom...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Policy Improvement for POMDPs Using Normalized Importance Sampling

نویسنده

چکیده

منابع مشابه

Importance Sampling Estimates for Policies with Memory

Monte Carlo Sampling Methods for Approximating Interactive POMDPs

The Cross-Entropy Method for Policy Search in Decentralized POMDPs

An Empirical Analysis of Off-policy Learning in Discrete MDPs

A comparative study of counterfactual estimators

عنوان ژورنال:

اشتراک گذاری